install.packages("survival")
library(survival)
Germans Trias i Pujol Research Institute and Hospital (IGTP)
Badalona, Spain
April 4, 2025
1. Introduction
2. Survival function
3. Comparing survival curves
4. Cox regression models
5. Take home messages
Aim: To analyse time-to-event data.
Most common events of interest: death, relapse, development of a specific disease, reintervention.
Survival time: time from some fixed starting point to occurrence of a given event or to lost to follow up.
Follow a subject until the event of interest occurs or is lost to follow-up.
At the end of the study we almost never observe the event of interest in all subjects
Event observed
Event not observed \(\longrightarrow\) Censored observation
Examples
Compare the ability of two different treatments to prolong survival in patients with a particular disease.
Identify risk/protective factors for time to having a stroke.
Study the relationship between diet and the time to the development of a cardiovascular disease in three different dietary groups.
Identify factors associated with time to organ rejection after transplantation.
Survival time (T): time to an event of interest (e.g, time from diagnosis to death).
Survival function: \[S(t) = P (T > t) = P (\mbox{an individual survives longer than t})\]
Interpretation
Probability of surviving beyond t.
Proportion of individuals surviving longer than t.
Probability that the event does not occur before t.
Some properties
It depends on time and it is a non-increasing function.
\(S(0) = 1\)
It is usually described using a graph (survival curve).
Survival data from a sample of \(n\) subjects from the study population should be as follows:
\(t_1, \ldots, t_n\): survival times.
\(\delta_1, \ldots, \delta_n\): censoring indicator, where
This dataset contains survival data on 228 patients with advanced lung cancer from the North Central Cancer Treatment Group
inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
1 3 306 2 74 1 1 90 100 1175 NA
2 3 455 2 68 1 0 90 90 1225 15
3 3 1010 1 56 1 0 90 90 NA 15
Survival variables
time: Survival time in days
status: censoring status 1=censored, 2=dead
Note that status is coded in a non-standard way. Typically you will see 1=event, 0=censored. Let’s recode it to avoid confusion:
Surv() function creates survival objects from variables time and event.
surv_obj <- Surv(time = lung$time, event = lung$status) # creates a survival object
head(surv_obj, 10) # prints first 10 values [1] 306 455 1010+ 210 883 1022+ 310 361 218 166
It results in a value for each subject: the time followed by + if censored
Event coding: Be sure that the event indicator is correctly encoded
Surv() function accepts events coded as TRUE/FALSE, 1/0 or 2/1.
TRUE, 1, 2 will be read as the event. FALSE, 0, 1 will be read as censored observations.
Non-parametric method to estimate the survival function.
It is one of the most widely used methods in survival analysis.
Results in a step function, where there is a step down each time an event occurs.
The survfit() function estimates survival curves using Kaplan-Meier (KM) method
You must specify a formula with a survival object (left-hand side of the formula) and 1 (right-hand side)
And we can plot the survival curve with ggsurvplot() function from {survminer} package .
We can customise the survival curve
We can get KM survival estimates at specific times using the summary() function on a survfit object
Call: survfit(formula = Surv(time = time, event = status) ~ 1, data = lung)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
365 65 121 0.409 0.0358 0.3447 0.486
730 13 38 0.116 0.0283 0.0716 0.187
1-year probability of survival in this study is 41%.
Lower and upper bounds of the 95% confidence interval (CI) are also displayed.
Another quantity of interest in a survival analysis is the median survival time.
We can calculate the median survival time by printing the survfit object
quantile() functionAim: To compare survival times between two (or more) groups of subjects.
Differences can be illustrated by plotting the Kaplan-Meier curve for each group.
This is not enough to conclude that the groups do or do not have different survival times.
Are these differences significant or just random variations? \(\longrightarrow\) Log-rank test
Is a hypothesis test used to compare survival distributions.
It is non-parametric and provides a comparison of the overall survival of the groups.
It is appropriate when survival curves do not cross.
Test:
We get the log-rank p-value using the survdiff() function.
Do survival functions differ between the sexes?
library(dplyr)
# define factor for sex
lung <- lung |>
mutate(sex = factor(sex, levels=1:2, labels = c("male", "female")))
# specify formula
survdiff(Surv(time, status) ~ sex, data = lung) Call:
survdiff(formula = Surv(time, status) ~ sex, data = lung)
N Observed Expected (O-E)^2/E (O-E)^2/V
sex=male 138 112 91.6 4.55 10.3
sex=female 90 53 73.4 5.68 10.3
Chisq= 10.3 on 1 degrees of freedom, p= 0.001
p-value = 0.001 \(\longrightarrow\) significant difference in overall survival according to sex
We can plot the survival curves for each group with ggsurvplot() function and add the log-rank test.
What if we need to…
Study the effect of a continuous variable?
Analyse more than one variable?
Quantify the effect of a possible risk/protective factor?
\(\longrightarrow\) Cox models
Cox regression models or Cox proportional hazards models examine the effect of several variables on survival.
Both quantitative and qualitative variables can be studied.
They are a useful tool for identifying risk/prognosis factors.
Examples
What is the effect of ethnicity, gender and socioeconomic status on the risk of developing a particular disease?
In cancer patients, is the risk of relapse higher in older patients? Can we quantify this increased risk?
Hazard function: \(h(t)\) represents the hazard of experiencing the event of interest at time t.
Cox model formulation:
\[h(t | X_1, X_2, \ldots, X_p)=h_0(t) \cdot \exp(\beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p)\]
\(X_1, \ldots, X_p\): Variables of interest.
\(h_0(t)\): Baseline hazard function.
\(\beta_1, \ldots, \beta_p\): Regression coefficients of each variable.
\(\beta_1, \ldots, \beta_p\): indicate the magnitude of the effect of the corresponding variable.
For \(i=1,\ldots,p\), the \(p\)-values associated with these estimates correspond to
and they indicate whether the \(i\)-th variable has a statistically significant effect on survival/hazard of the event.
The hazard ratio (HR) for the \(i\)-th variable is \(\exp(\beta_i) = e^{\beta_i}\)
Quantitative variables: Change in the risk of the event occurring for each unit increase in the variable at any t.
Qualitative variables: Change in the risk of the event occurring for each category of the variable with respect to the reference category at any t.
HR interpretation
\(\mbox{Hazard ratio}>1 \rightarrow\) The risk of the event occurring increases.
\(\mbox{Hazard ratio}<1 \rightarrow\) The risk of the event occurring decreases.
\(\mbox{Hazard ratio}=1 \rightarrow\) The risk of the event occurring remains the same.
We can fit a Cox model on R with coxph() function to compare the risk of death between different categories of sex:
We can visualise the model results in a fancy table:
Using tbl_regression() function from {gtsummary} package
Specifying exp=TRUE to show Hazard Ratio.
| Characteristic | HR | 95% CI | p-value |
|---|---|---|---|
| sex | |||
| male | — | — | |
| female | 0.59 | 0.42, 0.82 | 0.001 |
| Abbreviations: CI = Confidence Interval, HR = Hazard Ratio | |||
Male is the reference category \(\longrightarrow\) HR calculated for females relative to males.
p.value=0.001 and 95% CI does not cross 1 \(\longrightarrow\) Sex is associated with death.
\(\mbox{HR}_{\mbox{female}} < 1 \longrightarrow\) Females have lower hazard of death than males.
| Characteristic | HR | 95% CI | p-value |
|---|---|---|---|
| sex | |||
| male | — | — | |
| female | 0.59 | 0.42, 0.82 | 0.001 |
| Abbreviations: CI = Confidence Interval, HR = Hazard Ratio | |||
HR = 0.59 \(\longrightarrow\) The risk of death was 0.59 times lower for females than for males at any t (or the risk of death was 41% lower for females)
\(\mbox{HR}_{\mbox{male}}=\frac{1}{\mbox{HR}_{\mbox{female}}}=\frac{1}{0.59}=1.69 \longrightarrow\) The risk of death is almost 70% higher for men.
| Characteristic | HR | 95% CI | p-value |
|---|---|---|---|
| age | 1.02 | 1.00, 1.04 | 0.042 |
| Abbreviations: CI = Confidence Interval, HR = Hazard Ratio | |||
p.value=0.042 \(\longrightarrow\) Age is associated with the risk of death.
HR > 1 \(\longrightarrow\) If age increases, the risk of death increases.
HR = 1.02 \(\longrightarrow\) At any t, for a 1-year increase, the risk of death is 1.02 times the risk in the previous year.
HR = 1.02 \(\longrightarrow\) An increase of 1 year increases the hazard of death by 2%.
HR for a 5-year increase: \[e^{5 \cdot \beta_{age}} = (e^{\beta_{age}})^5=(\mbox{HR}_{age})^5=1.02^5=1.10\]
Let’s build a Cox model to study the effect of age, sex and ph.ecog on death:
| Characteristic | HR | 95% CI | p-value |
|---|---|---|---|
| age | 1.01 | 0.99, 1.03 | 0.2 |
| sex | |||
| male | — | — | |
| female | 0.58 | 0.41, 0.80 | <0.001 |
| ph.ecog | 1.59 | 1.27, 1.99 | <0.001 |
| Abbreviations: CI = Confidence Interval, HR = Hazard Ratio | |||
With sex and ph.ecog on the model, age has not a significant effect on death
Females have lower hazard of death than males (HR<1)
ph.ecog increases the hazard of death (HR>1)
Cox models assume that the proportional hazards assumption is met for each covariate \(\longrightarrow\) the hazard ratios between groups remain constant over time.
The proportional hazards assumption should be tested:
The proportional hazards assumption in a Cox model can be tested with cox.zph() function.
The null hypothesis of proportionality of hazards for each variable in the model is tested against the hypothesis of non-proportionality of hazards.
chisq df p
age 0.188 1 0.66
sex 2.305 1 0.13
ph.ecog 2.054 1 0.15
GLOBAL 4.464 3 0.22
In our model, all p-values > 0.05 \(\longrightarrow\) the null hypothesis is not rejected.
Our model meets the proportional hazards assumption.
Very sensitive test (specially with high sample sizes) \(\longrightarrow\) take a look at the plots.
Graphical diagnostics using ggcoxzph() function ({survminer} package)
Generates plots of Schoenfeld residuals versus time for each covariate.
Proportional hazards assumes that estimates do not vary much over time.
Deviations from a horizontal line are indicative of non-proportional hazards.
Plot the log-log Kaplan Meier survival estimates against time (or log of time).
Evaluate whether the curves are parallel and do not cross.
Only useful for categorical variables.
Add interaction of the covariate with time.
Stratify for the variable that violates the proportional hazards assumption.
Time-to-event variables cannot be analysed using standard methods.
Kaplan-Meier is the most common method to estimate survival curves.
Survival curves can be compared using the log-rank test.
Cox models are the regression models used to study the association between survival and one or more predictor variables.
Applied Biostatistics Course with R